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Abstract 

Developers face a wide choice of programming languages and libraries supporting multi- 
core computing. Ever more diverse paradigms for expressing parallelism and synchronization 
become available while their influence on usability and performance remains largely unclear. 
This paper describes an experiment comparing four markedly different approaches to parallel 
programming: Chapel, Cilk, Go, and Threading Building Blocks (TBB). Each language is 
used to implement sequential and parallel versions of six benchmark programs. The imple- 
mentations are then reviewed by notable experts in the language, thereby obtaining reference 
versions for each language and benchmark. The resulting pool of 96 implementations is used 
to compare the languages with respect to source code size, coding time, execution time, and 
speedup. The experiment uncovers strengths and weaknesses in all approaches, facilitating 
an informed selection of a language under a particular set of requirements. The expert review 
step furthermore highlights the importance of expert knowledge when using modern parallel 
programming approaches. 

1 Introduction 

The industry-wide shift to multicore processors is expected to be permanent because of the 
physical constraints preventing frequency scaling [2]. Since coming hardware generations will 
not make programs much faster unless they harness the available multicore processing power, 
parallel programming is gaining much importance. At the same time, parallel programs are 
notoriously difficult to develop. On the one hand, concurrency makes programs prone to errors 
such as atomicity violations, data races, and deadlocks, which are hard to detect because of 
their nondeterministic nature. On the other hand, performance is a significant challenge, as 
scheduling and communication overheads or lock contention may lead to adverse effects, such 
as parallel slow down. 

In response to these challenges, a plethora of advanced programming languages and li- 
braries have been designed, promising an improved development experience over traditional 
multithreaded programming, without compromising performance. These approaches are based 
on widely different programming abstractions, synchronization mechanisms, and programming 
paradigms. A lack of results that convincingly characterize both usability and performance of 
these approaches makes it difficult for developers to confidently choose among them. Current 
evaluations are typically based on classroom studies, e.g. EQl O I3 ; however, the use of novice 
programmers imposes serious obstacles to transferring experimental results into practice. 

This paper presents an experiment to compare multicore languages and libraries, applied 
to four approaches: Chapel |6j, Cilk [3j, Go [11], and Threading Building Blocks (TBB) [19j. 

*The ideas and opinions presented here are those of the authors, and not necessarily those of my employer. 
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The approaches are selected to cover a range of programming abstractions for paralleHsm, are 
aU under current active development and backed by large corporations. The experiment uses 
a two-step process for obtaining the program artifacts, avoiding problems of other comparison 
methodologies. First, an experienced developer implements both a sequential and a parallel 
version of a suite of six benchmark programs. Second, experts in the respective language review 
the implementations, leading to a set of reference versions. All experts participating in our 
experiment are high-profile, namely either leaders or prominent members of the respective com- 
piler development teams. This process leads to a solution pool of 96 programs, i.e. six problems 
in four languages, each in a sequential and a parallel version, and before and after expert review. 
This pool is subjected to program metrics that relate both to usability (source code size and 
coding time) and performance (execution time and speedup). 

The experiment then statistically relates, for each metric, the approaches to each other. 
The results can thus provide guidance for choosing a suitable language according to usability 
and performance criteria. Furthermore, the expert review step helps quantify the influence of 
expert knowledge for obtaining high-quality programs, and provides insights into the design and 
implementation choices taken in specific languages. 

The remainder of this paper is structured as follows. Section [2] describes the experimental 
design of the study. Section [3] provides an overview of the approaches chosen for the experiment. 
Section |4] discusses the results of the experiment. Section [5] presents related work and Section [6] 
concludes with an outlook on future work. 

2 Experimental design 

This section presents the research questions addressed by the study and describes the design of 
the experiment to answer them. 

2.1 Research questions 

Approaches to multicore programming are very diverse. To begin with, they are rooted in one 
basic programming paradigm such as imperative, functional, and object-oriented programming 
or multi-paradigmatic combinations of these. A further distinction is given by the communi- 
cation paradigm used, such as shared memory and message-passing or their hybrids. Lastly, 
they difi^er in the programming abstractions chosen to express parallelism and synchronization, 
such as Fork- Join, Algorithmic Skeletons [7], Communicating Sequential Processes (CSP) [T2] . 
or Partitioned Global Address Space (PGAS) [1] mechanisms. 

In spite of this diversity, all multicore languages share two common goals: to provide im- 
proved language usability by offering advanced programming abstractions and run-time mecha- 
nisms, while facilitating the development of programs with a high level of performance. While 
usability and performance are natural goals of any programming language, both aspects are 
particularly relevant in the case of languages for parallelism. First, achieving only average per- 
formance is not an option: parallel languages are employed precisely because of performance 
reasons. Second, usability is crucial because of the claim to be able to replace traditional 
threading models, which are ill-reputed precisely because of their lack of usability: unrestricted 
nondeterminism and the use of locks are among the aspects branded as error-prone. 

A comparative study of parallel languages has to evaluate both usability and performance 
aspects. Hence, the abstract research questions are: 

Usability How easy is it to write a parallel program in language L? 
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Performance How efficient is a parallel program written in language L? 

While correctness of the program is without doubt the most important dimension, it must be 
treated as a prerequisite to both the usability and performance evaluation if they are to make 
sense. Therefore we measure the usability of obtaining a program that solves a problem P 
correctly, and the performance of that program. 

The abstract research questions need to be translated into concrete ones that correspond to 
measurable data. In this experiment, the following concrete research questions are investigated: 

Source code size What is the size of the source code, measured in lines of code (LoC), of a 
solution to problem P in language LI 

Coding time How much time does it take to code a solution to problem P in language LI 

Execution time How much time does it take to execute a solution to problem P in language 
L if n processor cores are available? 

Speedup What is the speedup of a parallel solution to problem P in language L over the fastest 
known sequential solution to problem P in language L if n processor cores are available? 

We argue that both source code size and coding time strongly relate to usability. Shorter cod- 
ing time can be due to many factors: availability of powerful language constructs that simplify 
the implementation; better support of the developer's reasoning, resulting in fewer iterations to 
obtain a correct program; availability of better documentation, examples, or development tools, 
speeding up the programmer's dwell time with a language issue; and others. While the mea- 
surement of coding time abstracts from such specific reasons, it is clear that it is directly related 
to these usability issues. While usability might also be captured by qualitative methods, e.g. 
asking about the perceived benefit of using a particular approach, we opted in this experiment 
to deal with quantifiable data only. 

Execution time and speedup relate to performance. Measuring speedup in addition to ex- 
ecution time gives important benefits: being a relative measure, it factors out performance 
deficiencies also present in the sequential base language, drawing attention to the power of the 
parallel mechanisms offered by the language; it also reveals scalability problems. 

From the concrete research questions, it is apparent that the experimental setup has to 
provide for both a set L of languages as well as a set P of parallel programming problems. 
Sections |2.2| and |2.3| explain how these sets were chosen. 

2.2 Selection of languages 

The set L of languages considered in the experiment, further discussed in Section [3| was selected 
in the following manner. We first collected parallel programming approaches using web search 
and surveys, e.g. |21j . resulting in 120 languages and libraries. We then applied two requirements 
to narrow down the approaches: 

• The approach is under active development. This criterion was critical for the study, as it 
ensures that notable experts in the language are available during the expert review phase. 

• The approach has gained some popularity. This criterion was used to ensure that the 
results obtained by the experiment stay relevant for a longer time. Backing by a large 
corporation and a substantial user base were taken as signs of popularity. 
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The first criterion was by far tlie most important one, eliminating 73% of approaches, as many 
languages were academic or industrial experiments that appeared to have been discontinued. 
Among the remaining approaches, about half of them were considered popular approaches. In 
the final selection, we preferred those approaches which would add to the variety of programming 
paradigms, communication paradigms, and/or programming abstractions considered. Well es- 
tablished approaches such as OpenMP and MPI (industry de-facto standards for shared memory 
and message passing parallelism) were not considered, as we wanted to focus on cutting-edge 
approaches. 

2.3 Benchmark problems 

The set P of parallel programming problems was chosen from already suggested problem sets 
in the literature, e.g. [231 UHl El]- Reusing a tried and tested set has the benefit that estimates 
for the implementation complexity exist and that problem selection bias can be avoided by the 
experimenter. 

After consideration, we chose the second version of the so-called Cowichan problems [23] 
(first version in [23]) as benchmarks, for the following reasons. First, the problems comprehend 
a wide range of parallel programming patterns, which is crucial to a comparative study. Second, 
given that we study four approaches in the experiment, it was important to keep the amount 
of time spent with every single problem reasonably small. The chosen problem set has been 
designed for this purpose [23]; in order to be more representative of large applications, the 
problems can however also be chained together. 

In j24| . 13 such problems are presented. Again to keep the number of implementations 
manageable within the experiment, we selected the following six out of these: 

• Random number generation (randmat) 

• Histogram thresholding (thresh) 

• Weighted point selection (winnow) 

• Outer product (outer) 

• Matrix- vector product (product) 

• Chaining of problems (chain) 

Note that the last problem, chain, corresponds to a chaining together of the inputs and outputs 
of the other five. 

2.4 Implementation 

The benchmark problems were implemented in the chosen approaches observing the following 
practical considerations: 

• Use an experienced developer, but with no previous exposition to any of the chosen ap- 
proaches, to implement all problems. 

• Confirm that all solutions are correct by regression testing with predefined inputs and 
oracles. 

• Confirm that all parallel versions show some speedup over the respective sequential ver- 
sions. 



4 



• Use a version control system to measure the time to produce a solution through the 
commit times. Commit messages of the form "language-problem-variant keyword" were 
used, where language £ L, problem € P, variant G {seq, par, expertseq, expertpar}, and 
keyword G {start, pause, resume, done}. 

We use an experienced developer (six years of experience; working at Google Inc.), rather than 
novice programmers, because of known issues with classroom approaches: parallel programs are 
hard to get right and to get to perform well, which makes the use of inexperienced programmers 
questionable. For example, Ebcioglu et al. [8] report that about one third of their students could 
not successfully complete a correct solution that achieved any speedup, regardless of which of 
the three languages in their experiment was used; while this underlines the difficulty of parallel 
programming, it would be problematic to use such results to evaluate the languages. 

Instead of a single experienced developer, one could also use a group of developers. The 
main problem with this approach is of a practical nature, in particular recruiting and budget. 
We decided instead to combine the use of a single developer with an expert review step (see 



Section 2.5), which has additional advantages. 



2.5 Expert review 

As a key step in the study, experts in the respective languages were asked to review the initial 
implementations. This expert review had two main rounds. In the first round, the language 
designers or leaders of the respective development teams were contacted by email and asked for 
their help. Links to individual solutions in a browsable online repository were provided and brief 
instructions for the code review given, calling for any kind of feedback but especially on a) ways 
to make the implementations more concise or elegant and b) ways to improve their performance. 
In all cases, the initial contacts either provided comments themselves or recommended a member 
of their team, leading to the following list of experts: 

• Chapel: Brad Chamberlain, Principal Engineer at Cray Inc. (technical lead on Chapel) 

• Cilk: Jim Sukha, Software Engineer at Intel Corp. (in the Cilk Plus development team, 
recommended by Charles E. Leiserson, one of the original Cilk designers) 

• Go: Luuk van Dijk, Software Engineer at Google Inc. (in the Go development team led 
by Andrew Gerrand, and recommended by him) 

• TBB: Arch D. Robison, Sr. Principal Engineer at Intel Corp. (chief architect of TBB) 

After addressing comments from the first round, initial measurements were undertaken. 
The results were forwarded to the experts in a second round, together with links to improved 
implementations and requests for comments on the measurements. Comments from the second 
round were again incorporated. 

The expert review step has the advantage that it produces a set of reference versions of 
the benchmarks, creating a standard that holds across the different languages. Furthermore, it 
allows for measuring, to some degree, the influence of expert knowledge. 



3 Languages 

This section provides the background on the approaches chosen for the experiment: Chapel, 
Cilk, Go, and TBB. Table [l] summarizes their characteristics, together with year of appearance, 
and the corporation currently supporting further development. 
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Name 


Programming 
abstraction 


Communication 
paradigm 


Programming 
paradigm 


Year 


Corporation 


Lnapel 


Partitioned Global Ad- 
dress Space (PGAS) 


message passing / 
shared memory 


object-oriented 


zOOd 


Cray inc. 


Cilk 


Structured Fork- Join 


shared memory 


imperative / 
object-oriented 


1994 


Intel Corp. 


Go 


Communicating Sequen- 
tial Processes (CSP) 


message passing / 
shared memory 


imperative 


2009 


Google Inc. 


TBB 


Algorithmic Skeletons 


shared memory 


C-I--I- library 


2006 


Intel Corp. 



Table 1: Main language characteristics 



3.1 Chapel 



Chapel |6] describes parallelism in terms of independent computation implemented using threads, 
but specified through higher-level abstractions. It won the HPC Most Elegant Language award 
in 2011 [15 j. Chapel's development effort to date has focused on correctness rather than perfor- 
mance optimizations expert comments for Chapel therefore reflect a tension between writing 
for performance and writing for clarity. 

We use parallel-for as a running example in this section. The concurrent variant of the for 
statement, forall, provides parallel iteration over a range of elements, as shown in Listing [T] 
the loop will execute, possibly in parallel, for all the elements between 1 and n (inclusive). 
Control continues with the statement following the forall loop only after every iteration has 
been evaluated. 



forall i in 

work ( i ) ; 



{l..n} { 



Listing 1: Chapel: parallel-for 

The reduce statement collapses a set of values down to a summary value, e.g. computing the 
sum of the values in an array. The scan statement is similar to the reduction, but stores the 
intermediate reductions in an array, e.g. computing the prefix sum. 

3.2 Cilk 

Cilk exposes parallelism through high-level primitives that are implemented by the runtime 
system, which takes care of load balancing using dynamic scheduling through work stealing. 
The language won HPC Best Overall Productivity award 2006 ^15j. Cilk's development started 
at MIT; since 2009 the technology has been further developed at Intel as Cilk Plus (integration 
in commercial compiler, change of keywords, language extensions)]^ 

The keyword cilk_spawn marks the concurrent variant of the function call statement, 
which starts the (possibly) concurrent execution of a function. The synchronization statement 
cilk_sync waits for the end of the execution of all the functions spawned in the body of the 
current function; there is an implicit cilk_sync statement at the end of all procedures. Lastly, 
there is an additional cilk_for construct, see Listing [2j 



^Personal communication with Brad Chamberlain. 

^Personal communication with Charles E. Leiserson and Jim Sukha. 
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cilk.for (int i = 0; i < n; i+4-) { 
work ( i ) ; 

} 

Listing 2: Cilk: parallel- for 

This construct is a limited parallel variant of the normal for statement, handling only simple 
loops. 

3.3 Go 

Go [TT] is a general-purpose programming language targeted towards systems programming. 
Parallelism is expressed using an approach based on Communicating Sequential Processes 

(csp) ng. 

The statement go starts the execution of a function call as an independent concurrent thread 
of control, or goroutine, within the same address space. Channels (indicated by the chan type) 
provide a mechanism for two concurrently executing functions to synchronize execution and 
communicate by passing a value of a specified element type; channels can be synchronous or 
asynchronous. 

To construct a parallel-for loop, shown in Listing [3j the work gets dispatched to a channel 
(index) from one go routine, while NP goroutines fetch work from this channel and process it. 



index := make (chan int) 
done := make (chan bool) 
NP := runtime . GOMAXPROCS (0) 

go func() { 

for i := 0; i < n; i-H- { 
index <— i 

} 

close ( index ) 

}() 

for i := 0; i < NP ; i++ { 
go func() { 

for i ;= range index { 
work ( i ) 

} 

done <— true 

}{) 

} 

for i := 0; i < NP ; i++ { 
<— done 

} 

Listing 3: Go: parallel-for 

NP denotes the number of processors or threads that are to be used. To synchronize, each worker 
thread will send true through a done channel (done <— true); the main thread waits on this 
channel (<— done) for NP values to come across the channel before proceeding, indicating all 
workers have completed. 

3.4 Threading Building Blocks (TBB) 

Threading Building Blocks (TBB) |19| is a parallel programming template library for the C++ 
language. Parallelism is expressed using Algorithmic Skeletons [7], and the runtime system takes 
care of scheduling and load balancing using work stealing. 
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The function parallel_f or performs (possibly) parallel iteration over a range of values, as 
shown in Listing |4j the iteration is executed in non-deterministic order. 



parallel_f or ( 
range ( , n ) , 
[=](range r) { 

for {size_t i = r. begin (); i != r.end(); i++) { 
work ( i ) ; 

} 

}); 



Listing 4: TBB: parallel-for 

The parallel_f or function takes as arguments the range over which to iterate, and a lambda 
expression that itself will be given a subrange that it may iterate across performing the work. 

The parallel_reduce and parallel_scan functions perform the same parallel operations 
as Chapel's reduce and scan. 



4 Results 

This section presents and discusses the data collected in the experiment as defined in Section [2] 
4.1 Preliminaries 

Table [2] provides absolute numbers for all versions of the code, before and after expert review. 
Note that the size of the chain problem is not necessarily greater than the sum of the sizes of 
the subproblems. This is because the intermediary input/output functions were removed and, 
whenever possible, code from one problem was reused in the others. Unless stated otherwise, the 
discussion of the data in Section |4] refers to the expert-parallel versions, i.e. the parallel versions 
obtained after expert review. 

To facilitate comparison, all figures display the data in value-normalized form, namely rela- 
tive to the smallest /fastest /etc. measurement per problem (which itself gets the value 1.0). 

Statistical evaluation. The results are statistically evaluated using the Wilcoxon signed-rank 
test (two-sided variant), a non-parametric test for paired samples. Specifically, for all metrics, 
each language is compared with each other language across all problems. We will say that "A is 
significantly different from B" regarding a specific metric if p < 0.05; we will say that "A tends 
to be different from ff 0.05 <p< 0.1. 

We will represent the language relationships using graphs, where a solid arrow is drawn from 

5 to ^ if ^ is significantly better than B in a certain metric; a dotted arrow is drawn if A tends 
to be better than B. The ordering relations are transitive, but this will not explicitly be shown 
in the figures for clarity. 

Rating function. The statistical evaluation states the difference of two languages in qualita- 
tive terms, but does not expose the magnitude of this difference. The magnitude is important, 
however, because although two languages are significantly different regarding a certain metric, 
the relative difference might be small enough to be negligible in certain use cases. To address 
this, we define the average relative rating of each language amongst the other languages, for a 
specific metric m: 
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/r\ 1 m(L,P) 
rating ^{L) = — ■ y — , , Leh 

where: 

P set of problems 

L set of languages 

m : L X P — )• (0, oo) metric function 

rating : L — t- [1, oo) rating function 

For each language, the rating function calculates the average of the language's relative perfor- 
mance in each problem compared to the best performance of any language in the same problem. 
Thus, if the language was the best in all problems in a given metric, the result will be 1.0; a 
value of 2.0 for a given language and metric means that, on average, the language was 2 times 
"worse" (slower or larger, etc., depending on the metric) than the best language for that metric 
in each problem; etc. 



4.2 Source code size 

The graph in Figure [T] shows the relative number of lines of source code (LoC) across all languages 
and problems, normalized to the smallest size in each problem. 
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Figure 1: Source code size (LoC) 



Chapel shows the smallest code size in all of the problems, which can be explained by the 
conciseness of its language-integrated parallel directives. All the other languages are typically 
around 1.5-2.0 times larger, relative to Chapel's code size. Go's code size is the largest in all 
of the problems; reasons for this are the space taken for setting up the goroutines whenever 
a parallel operation is needed, and for synchronization with channels. Cilk and TBB hold a 
middle ground and are often comparable in code size. 



Results of the Wilcoxon test and of the rating function are combined in Figure l2l (Section 4.1 



explains how to interpret the graph). The placement of a language along the x-axis reflects its 
rating according to the rating function. 

This confirms statistically the visual interpretation of Figure [T} Chapel provides the most 
concise code overall, and Go the largest code size (on average about 2.1 times as large as Chapel's 
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Chapel ^ Cilk < Go 

TBB 

I 1 1 \ 

1 1.5 1.7 2.1 

Figure 2: Source code size: statistical ordering and rating 

code). Cilk and TBB are in between, with Cilk tending to have smaller code sizes than TBB 
for the same problem. 

4.3 Coding time 

Figure [3] shows the relative time to code for each problem in each language. Note that times 
are cumulative in the following way: the coding time in the reference versions (expert-parallel 
time) is the sum of the time used to obtain the initial parallel version (parallel time) plus the 
time needed to refine it after the expert comments; as the parallel versions were based on the 
sequential ones, the parallel coding time itself includes the sequential coding time (sequential and 
expert-sequential time, respectively) in all cases. The difference between parallel and expert- 
parallel versions is explored in Section [4. 6[ 




randmat thresh winnow outer product chain 



Figure 3: Coding time 

In contrast to the lines of code metric, the figure does not suggest any immediate conclusions. 
No clear ordering is visible, although TBB seems to have consistently low (but not always 
lowest) coding times. This is confirmed by the statistical evaluation, which yields no significant 
differences (displayed again as graph in Figure [4]). The individual ratings show a clearer picture: 
coding in TBB takes on average only 1.2 times longer to code than the other three approaches, 
placing it at the top; Go, Cilk, and Chapel take on average at least 2.1 times longer, with Chapel 
taking 3.0 times longer. 

TBB Go Cilk Chapel 

I \ 1 \ \ 

1 1.2 2.1 2.6 3.0 

Figure 4: Coding time: statistical ordering and rating 
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4.4 Execution time 



Measurement. The performance tests were run on a 4 x Intel Xeon Processor E7-4830 (2.13 
GHz, 8 cores; total 32 physical cores) server with 256 GB of RAM, running Red Hat Enterprise 
Linux Server release 6.3. Language and compiler versions used were: chapel-1.6.0 with gcc-4.4.6, 
for Chapel; Intel C++ Compiler XE 13.0 for Linux, for both Cilk and TBB; go-1.0.3, for Go. 

Each performance test was repeated 30 times, and the mean of the results was taken. All 
tests use the same inputs, the size-dominant of which is a 4 • 10^ x 4 • 10^ matrix (about 12 GB of 
RAM). This size, which is the largest input size all languages could handle, was chosen to test 
scalability. The language Go provided the tightest constraint, while the other languages would 
have been able to scale to even larger sizes. 

An important factor in the measurement is that for all problems the I/O time is significant, 
since they involve reading/writing matrices to/from the disk. In order for the measurements to 
not be dominated by I/O, a special flag is_bench was added to every solution. This flag means 
that neither input nor output should occur and that the input matrices should be generated 
on-the-fly instead. All performance tests were run with the is_bench flag set. 

Observations. Figure [5] shows the relative execution time on 32 cores for each language and 
problem. The error bars show the 99.9% confidence interval for the mean using Student's t-test. 
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Figure 5: Execution time 



Chapel took the most time to execute in almost all problems. As mentioned in Section 3.1 



this reflects that the Chapel compiler is missing important optimizations, which were deferred 
to give first more attention to correctness. Also, all Chapel variables are default initialized; 
in particular the large matrix in the experiment will be zeroed, causing additional delay. The 
distance to the other languages decreases significantly as the input size is decreased, hinting 
at the fact that the main problem is scalability (Chapel's speedup reaches a plateau early, as 



discussed in Section 4.5). 



Go shows uneven execution times across the problems, which might be explained by the 
language's lack of maturity (only 3 years old); the performance might show more stable results 
in the future. In particular, the execution time for the chain and outer problems are much higher 
than expected, they should be on the same order of magnitude as the other subproblems. Chain 
additionally has a much higher variance than expected. 

TBB and Cilk show consistently low execution times. 
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This impression is confirmed statistically, as shown in Figm'e[6j Both Chapel and Go exhibit 
a significantly slower execution time than Cilk and TBB. Considering the rating, TBB and Cilk 
are on par with a score of 1.2, followed by Go at 6.9 and Chapel at 17.0. 

Cilk <- Chapel 



— h 
17.0 

Figure 6: Execution time: statistical ordering and rating 
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4.5 Speedup 




Figure 7: Speedup per problem 



Measurement. Speedup was measured across 1, 2, 4, 8, 16, and 32 cores, with respect to the 
fastest single thread implementation in the respective language; this is the fastest implementation 
when executed on a single logical thread, i.e. either the sequential version itself, or the parallel 
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version restricted to run on a single thread. 

speedup (n) 



Tp{n) 



where: 

Tg fastest single thread implementation (sequential 

or parallel) 

Tp{n) execution time of the expert-parallel version 

with n cores 
speedup{n) speedup with n cores 

Indeed, we observed across all languages that in about 46% of the cases the single-thread parallel 
version is faster, as parallel restructuring improves running time even on a single thread. 

Observations. Figure [7] shows the speedup graphs per problem. The values are accurate 
within the 99.9% confidence interval using Fieller's theorem for the calculation of confidence 
intervals for the ratio of two means (error bars would not be visible on the plot). 

For the problems product and outer, the speedup in all languages tends to plateau starting 
from about 16 cores. This can be partly attributed to the fact that the sequential versions 
already take very little time to execute on these problems; the input size would have to be 
further increased (but cannot without losing the ability to compare amongst all approaches, as 
discussed above). In all other problems, at least one language shows good scalability; as the 
number of cores increase, the speedup lines fan out, showing that there are significant differences. 

Cilk and TBB show good scalability on these problems, with speedups of about 15-20 and 
10-21 on 32 cores, respectively. 

Go's scalability is more uneven: in product and randmat it keeps up with the top performers; 
a plateau is visible in thresh at 16 cores; and speedup deterioration is detected in chain and outer. 
The deterioration might be caused by excessive creation of goroutines, generating scheduling and 
communication overheads. 

Chapel's speedup consistently plateaus early from around 4-8 cores and at a speedup of 
around 2-3 in all problems, but does not specifically underperform in any of them. This shows 
the need of an an overall improvement in the Chapel compiler's implementation (see discussion 
in Section 3.1[ ). 

Figure 8] shows the results of the statistical tests and the application of the rating function, 
for the speedup at 32 cores. We opted for using the speedup at 32 cores, as it represents the 
best approximation available to the asymptotic speedup. Note that the rating function has to 
be modified slightly: since in the speedup measure larger is better, the inverse of the metric 
value is used. 

Cilk < — =3 Chapel 



1.11.3 2.9 6.4 

Figure 8: (Inverse of) speedup: statistical ordering and rating 

Confirming the expectation from the speedup graphs. Chapel shows significantly worse 
speedup than Cilk and TBB, and tends to show worse speedup than Go. Cilk and TBB show 
no significant difference, while Go tends to show worse speedup than TBB. 
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4.6 Influence of expert review 

In the previous sections, all results were computed from the versions after expert review. This 
section explores in what way the expert review changed the original implementations with respect 
to the four metrics. 

Source code size. The differences between the non-expert and expert versions with respect 
to lines of code are given in Figure |9j In general, suggested changes increased or decreased 
the source code size by a moderate amount, typically only 10%-20% either way. One way of 
interpreting this result is that expert comments did not change the fundamental complexity of 
the code. 




Figure 9: Source code size (LoC) difference 

A number of observations can be made from the graph in Figure [9j All Cilk solutions have 
decreased in size by up to about 20%. This change can be traced back to one of the expert 
comments to replace cilk_spawn/cilk_sync style code (see Listing [5]) with cilk_for (Listing [2]); 
according to the expert, cilk_for simplifies the code while doing the same recursive divide-and- 
conquer underneath, and should therefore be preferred. 



void do_work(int begin, int end) { 

int middle = begin + (end — begin) / 2; 
if (begin + 1 = end) { 

work ( begin ) ; 
} else { 

cilk_spawn do_work (begin, middle); 
cilk_spawn do_work (middle , end) ; 

} 

cilk_sync ; 

} 

cilk_spawn do_work (0, n); 

Listing 5: Cilk: divide-and-conquer 

Increase in code size of around 35% are visible in Go randmat and winnow. For randmat, this 
can be explained by a suggested change of data structure; since the randmat program is small 
to begin with, this relatively small change amounts to a seemingly large increase percentage- 
wise. For winnow, the increase in performance results from the suggestion of the expert to add 
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merge-sort, which is not part of Go's standard hbrary; the original sort didn't parallehze well, 
resulting in a performance hit. 

The increases in size for Chapel can be explained by the hoisting of ranges/domain defini- 
tions, which would otherwise not be recognized as constants by the C compiler used by Chapel. 



Coding time. Figure 10 displays the differences in coding time between the non-expert and 
expert versions. The expert time is the sum of the non-expert time and the time spent on expert 
comments. 
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Figure 10: Coding time difference 



The graph shows that a maximum of 80% of the original coding time was spent on the 
optimizations suggested by the experts; in more than half of the cases, however, it is only up 
to 30% of the original time. This shows that the time spent with expert comments was not too 
extensive. In particular, none of the original problems needed to be rewritten completely, but 
changes were incremental. This is in accordance with the noted changes in source code size. 



Execution time. In Figure 11 the differences in execution time are displayed. As expected, 
expert comments reduced execution time in most of the cases. 
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Figure 11: Execution time difference 
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Strong improvements were made in the case of Go in most of the cases. This can be attributed 
to one important change in the way parahehsm was achieved. In the non-expert versions, a 
divide-and-conquer pattern of the form displayed in Listing |6] was frequently used. Instead, the 
expert recommended the distribute- work-synchronize pattern of Listing[3j While the divide-and- 
conquer approach creates one goroutine per row of the matrix, the distribute- work-synchronize 
creates one for each processor core; for large matrices, the overhead of the excessive creation of 
goroutines then causes a performance hit. In the problem outer, the Go expert had suggested to 
change the data structure from a one-dimensional to a two-dimensional array for clarity, without 
apparent performance differences on smaller problem sizes on a desktop machine. In the final 
measurement, it is however the cause of a 63% increase in execution time in the expert version; 
this highlights the fact that program optimizations have to take both the target machine and 
the target problem size into account. 

For Chapel, in the problem thresh, the expert execution time increases by about 70%. The 
expert gave comments on a version that was compiled with version 1.5 of Chapel. After changing 
to version 1.6 (which had appeared in the meantime) for the final measurements, the non-expert 
version experienced a significant reduction in execution time, while the expert version remained 
the same; this illustrates the fragility of compiler optimizations. 



func do_work ( begin , end, done chan bool) { 
if (begin + 1 = end) { 

work (begin, done) 
} else { 

middle := begin + (end — begin) / 2 
go do_work (begin, middle, done) 
do_work (middle, end, done) 

} 

} 

done := make (chan bool) 
go do_work(0, n, done) 

for i := 0; i < nrows ; i++ { 
<— done 

} 

Listing 6: Go: divide-and-conquer 

Speedup. Figure [T2| shows the changes in speedup on 32 cores. 

Except for Go, speedup seems to have been influenced little by the expert comments; most 
of the time no further speedup (i.e. 1 x speedup) is visible. In Go, strong improvements are 
visible for randmat, product, and chain. This is most likely caused by the change in concurrency 
pattern used, as discussed under the execution time differences. It emphasizes the fact that it 
is critical in Go to know about idiomatic patterns to make full use of the performance offered 
by the language. A slowdown is visible for the outer problem in Go, which corresponds to the 
discussed issue in outer for the execution time diff'erence. 

4.7 Threats to validity 

The fact that the individual problems are small threatens external validity, as it is unclear 
whether the results generalize to large problems. This threat was mitigated to some degree by 
selecting a problem set with diverse tasks, which occur in practice, and by evaluating the chain 
problem, which emulates a larger program as chaining of smaller ones. 
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Figure 12: Speedup difference 

Problem selection bias, a threat to internal validity, is avoided in part by using an existing 
problem set, instead of creating a new one. The threat that specific problems could be better 
suited to some languages than others remains, as it could already be present in the existing 
problem set. As a positive point, none of the experts criticized the choice of problems for 
evaluating their language. 

Since the languages are based on very different fundamental designs, it is not immediately 
clear whether or not they are actually comparable. But as long as it is possible to solve a 
problem in all languages the metric comparison seems to be meaningful. Again, the experts 
were aware of the competing languages and did not challenge the choice of languages. 

Kennedy et al. [K] note that direct implementation time measurement usually turns out 
to be invalid because it is difficult to factor out individual ability, and because one might be 
comparing novice programmers in one language against experienced ones in another. This is 
not a critical factor in this work, since there is only one developer. While being an experienced 
developer in general, he also had no previous experience with any of the parallel languages used 
in the experiment. 

Faulk et al. [9] say that lines of code, as a metric, must be applied cautiously, since the lines 
of code necessary to implement a particular functionality varies greatly from one programming 
language to another, and it is also influenced by the quality of the programmer. Using expert 
reviews helped address this threat, as the experts were explicitly asked to comment on style. 
Also, having a single experienced developer helps, again, to mitigate the influence of varying 
programming quality. 



4.8 Discussion 

Considering all four metrics together (see the summary of the ratings in Table [3]) , it becomes 
apparent that all four languages have individual strengths and weaknesses. 

Chapel has incorporated parallel directives at the language level. This has an apparent 
advantage as the code size is consistently the smallest of all problems: there is a clear benefit to 
having language-level support for high-level operations. However, the performance rates quite 
low, though this does not appear to be an inherent property of the language, but rather that the 



focus of the compiler implementation has been on other issues (see discussion in Section 3.1). 

Cilk's initial claim to fame was in lightweight tasks, which could quickly be balanced among 
many threads. Consequently, the language shows very strong results on the performance-related 
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Source Coding Execution (Inverse of) 
code size time time speedup 



Chapel 



1.0 3.0 17.0 6.4 
1.5 2.6 1.2 1.1 

2.1 2.1 6.9 2.9 
1.7 1.2 1.2 1.3 



Cilk 

Go 

TBB 



Table 3: Ratings for all metrics (smaller is better; best in bold) 



measures. Since then Cilk has gained a new keyword (cilk_for), which gives it also some 
advantage on the code size metric. 

Go has been designed for more general forms of concurrency; for example, using channels 
for communication but allowing shared memory where necessary is very flexible. Consequently, 
Go does not have extensive language or library support for structured parallel computations, 
such as fork-join, parallel-for, or parallel-reduce. This is evident in the source code size, which 
is often the largest. Go does an acceptable job on the performance measures, although some 
problems have been detected. Since the language is the youngest in the study (it appeared in 
2009), the compiler is expected to mature in this respect. 

TBB has no language-level support, it is strictly a library approach. However the library it 
provides is the most comprehensive of the four languages, containing algorithmic skeletons, task 
groups, synchronization and message passing facilities. The high level parallel algorithms were 
sufficient to implement every task in the benchmark set without dropping down to lower level 
primitives such as manual task creation and synchronization. TBB provides together with Cilk 
the best performance. Being a library for a well known language, it also has the fastest coding 
times. 

Although none of the languages have any mechanisms to ensure freedom from concurrency 
issues such as data races or deadlocks, their common aim is to provide the ability to use built-in 
functionality to make the common cases easy and as safe as they can be. 

5 Related work 

A number of related works present studies comparing approaches to parallel programming, albeit 
for languages different from the ones used in our experiment. 

Szafron and Schaeffer [22] assess the usability of two parallel programming systems (a mes- 
sage passing library and a high-level parallel programming system) using a population of 15 
students, and a single problem (transitive closure). Six metrics were evaluated: number of 
work hours, lines of code, number of sessions, number of compilations, number of runs, and 
execution time. They conclude that the high-level system is more usable overall, although the 
library is superior in some of the metrics; this highlights the difficulty in reconciling the results 
of different metrics. In contrast to this approach, we report no overall rank; instead we provide 
ranks within a metric, as the suitability of a language may depend on external factors that give 
different weight to each of the metrics. 

Hochstein et al. |13| provide a case study of the parallel programmer productivity of novice 
parallel programmers. The authors consider two problems (game of life and grid of resistors) 
and two programming models (MPI and OpenMP). They investigate speedup, code expansion 
factor, time to completion, and cost per line of code, concluding that MPI requires more effort 
than OpenMP overall in terms of time and lines of code. Hochstein et al. [14j compare pro- 
gramming effort for two parallel programming models (message-passing and PRAM-like), using 
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one problem (sparse-matrix dense- vector multiplication), with two groups of students. The used 
metrics are: development time and program correctness. The results show a 46% lower PRAM- 
like development time compared to message-passing, and no statistically significant difference 
in correctness rates. 

Rossbach et al. [20] conducted a study with 237 undergraduate students implementing the 
same program with locks, monitors, and transactions. While the students felt on average that 
programming with locks was easier than programming with transactions, the transactional mem- 
ory implementations had the fewest errors. Ebcioglu et al. [8] measure the productivity of three 
parallel programming languages (MPI, UPC, and XIO), using 27 students, and a single problem 
(Smith- Waterman local sequence matching). For each of the languages, about a third of the 
students could not achieve any speedup. The methodology used in our experiment, namely 
using an experienced programmer and expert feedback, was able to avoid low-quality solutions. 

Nanz et al. [17J present an empirical study with 67 students to compare the ease of use 
(program understanding, debugging, and writing) of two concurrency programming approaches 
(SCOOP and multi-threaded Java). They use self-study to avoid teaching bias and standard 
evaluation techniques to avoid subjectivity in the evaluation of the answers. They conclude 
that SCOOP is easier to use than multi-threaded Java regarding program understanding and 
debugging, and equivalent regarding program writing. Pankratius et al. [18] compare the lan- 
guages Scala and Java using 13 students and one software engineer working on three different 
projects. The resulting programs are compared with respect to programmer effort, code com- 
pactness, language usage, program performance, and programmer satisfaction. They conclude 
that Scala's functional style does lead to more compact code and comparable performance. 

Cantonnet et al. |5] analyze the productivity of two languages (UPC and MPI), using the 
metrics of lines of code and conceptual complexity (number of function calls, parameters, etc.), 
obtaining results in favor of UPC. Bal [3j is a practical study based on actual programming 
experience with five languages (SR, Emerald, Parlog, Linda and Orca) and two problems (trav- 
eling salesman, all pairs shortest paths). It reports the authors' experience while implementing 
the solutions. 

It is worth noting that all the above studies either use novices as study participants (problem 
with ensuring a high quality of the code), or use implementations of the study authors (problem 
with experimenter bias, if the authors are also the designers of the approaches). In our exper- 
iment, we use a single experienced developer and review of notable experts in the languages, 
avoiding these problems. 

6 Conclusions 

We presented an experiment comparing four popular approaches to parallel programming, pro- 
viding two main contributions. First, we defined a methodology for comparing multicore lan- 
guages, involving review by notable experts. We found that this methodology provides robust 
results as it ensures consistently high-quality program artifacts. It also enables studying the 
process of parallel program development and quantifying the influence of expert knowledge. 
Second, applying the experiment to Chapel, Cilk, Go, and TBB provided a detailed compara- 
tive study of usability and performance. The discussion of the differences of the languages was 
supported both by statistical tests and a rating function that quantifies these differences. This 
provided an unambiguous characterization of the approaches that can help developers choose 
among them. 

Our methodology can serve as a template for further comparisons, and we plan to apply 
it to more languages in the future. The example of Go showed that there is also a need for 
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benchmarks to evaluate languages with respect to more general concurrency patterns. Defining 
such a set of benchmarks is another interesting direction of future work. 
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